Skip to content

feat: add retry utility with exponential backoff#10

Open
jesseturner21 wants to merge 1 commit into
mainfrom
test/retry-utility
Open

feat: add retry utility with exponential backoff#10
jesseturner21 wants to merge 1 commit into
mainfrom
test/retry-utility

Conversation

@jesseturner21

Copy link
Copy Markdown
Owner

Summary

  • Adds withRetry() async helper with exponential backoff and jitter
  • Supports configurable max retries, delay bounds, and retryable error filtering
  • Adds parseRetryAfter() for parsing HTTP Retry-After headers

Test plan

  • Unit tests for retry logic
  • Integration with API calls that return throttling errors

@github-actions github-actions Bot added the size/s PR size: S label Apr 23, 2026
@jesseturner21 jesseturner21 reopened this Apr 23, 2026
@github-actions github-actions Bot added size/s PR size: S and removed size/s PR size: S labels Apr 23, 2026
Adds a generic retry helper for transient failures like API throttling
and network errors. Supports configurable max retries, exponential
backoff with jitter, and retryable error filtering.
@jesseturner21 jesseturner21 reopened this Apr 23, 2026
@github-actions github-actions Bot added size/s PR size: S ai-reviewing AI review in progress and removed size/s PR size: S labels Apr 23, 2026
@jesseturner21

Copy link
Copy Markdown
Owner Author

Bug: maxRetries semantics are off-by-one.

The loop is for (let attempt = 0; attempt < opts.maxRetries; attempt++), which means maxRetries: 3 results in 3 total attempts (1 initial + 2 retries), not 1 initial + 3 retries as the name implies. Most retry libraries (e.g., AWS SDK, p-retry) treat maxRetries as the number of retries after the initial attempt.

Options to fix:

  1. Change the loop bound to attempt <= opts.maxRetries so maxRetries: 3 gives 1 + 3 = 4 attempts.
  2. Rename the field to maxAttempts to make the current behavior unambiguous.

@jesseturner21

Copy link
Copy Markdown
Owner Author

Bug: sleeps after the final failed attempt.

On the last iteration of the loop, if fn() throws, the code still executes await sleep(delay) before falling out of the loop and throwing lastError. With the default maxDelayMs: 30000, this can add up to 30s of pointless delay before the caller sees the error.

Fix: skip the sleep when attempt === opts.maxRetries - 1 (or whatever the final-attempt check becomes after fixing the off-by-one).

@jesseturner21

Copy link
Copy Markdown
Owner Author

PR description says "exponential backoff and jitter" but there is no jitter.

The delay calculation Math.min(opts.baseDelayMs * Math.pow(2, attempt), opts.maxDelayMs) is fully deterministic. Without jitter, concurrent retriers from multiple callers/processes will synchronize and hammer the API on the same schedule — which defeats the main reason to add jitter when dealing with throttling.

Options:

  1. Add full jitter: delay = Math.random() * Math.min(baseDelayMs * 2^attempt, maxDelayMs).
  2. Add equal jitter (half fixed, half random).
  3. Update the PR description to remove the "jitter" claim if you intend to add it in a follow-up.

@jesseturner21

Copy link
Copy Markdown
Owner Author

Bug: parseRetryAfter can return NaN or a negative number.

const date = new Date(headerValue);
return date.getTime() - Date.now();
  • If headerValue is not a valid number and not a valid HTTP date, date.getTime() returns NaN, so the function returns NaN.
  • If the date is in the past (clock skew, stale response), the result is negative.

Either can end up passed to setTimeout, which will then fire immediately — but callers that use the value for logic (logging, upper-bound checks) will misbehave. Please validate the result and return something sensible (e.g., 0, or throw) for unparseable / negative values. Also consider passing a radix to parseInt (parseInt(headerValue, 10)).

@jesseturner21

Copy link
Copy Markdown
Owner Author

parseRetryAfter is exported but never consumed by withRetry.

The stated use case in the PR description is "API calls that return throttling errors," which is exactly where Retry-After headers matter. As written, a caller that wants to honor Retry-After has to implement their own retry loop — at which point withRetry isn't helping. Consider either:

  1. Letting fn signal a retry delay (e.g., by throwing an error with a retryAfterMs property, which withRetry then uses instead of the exponential delay).
  2. Accepting a getRetryDelay?: (err) => number | undefined option.

Otherwise the two pieces of this PR don't actually compose.

@jesseturner21

Copy link
Copy Markdown
Owner Author

retryableErrors matches against error.message via .includes(), which is unreliable for AWS SDK errors.

AWS SDK v3 exposes error codes on error.name (and sometimes error.$metadata), not in the message string. For example, a ThrottlingException from Bedrock/AgentCore typically has err.name === 'ThrottlingException' with a message like "Rate exceeded"err.message.includes('ThrottlingException') would be false.

Given the PR's stated goal ("Integration with API calls that return throttling errors"), this is likely to silently not retry the exact errors it's meant to handle.

Options:

  1. Match against err.name (and/or err.$metadata?.httpStatusCode) in addition to / instead of err.message.
  2. Replace retryableErrors: string[] with a predicate isRetryable?: (err: unknown) => boolean so callers decide.

@jesseturner21

Copy link
Copy Markdown
Owner Author

Missing: not exported from src/lib/utils/index.ts, and no unit tests.

  • Every other util in src/lib/utils/ is re-exported from index.ts (aws-account, credentials, env, platform, subprocess, time-parser, json-rpc, zod). retry.ts should be added there so it's importable via the barrel consistently.
  • Every other util has a sibling test under __tests__/. The PR's own test plan lists "Unit tests for retry logic" as an unchecked item — please add src/lib/utils/__tests__/retry.test.ts covering the success-after-retry, exhaustion, non-retryable-error, and parseRetryAfter (numeric, date, invalid) paths before merging.

@github-actions github-actions Bot added ai-reviewing AI review in progress and removed ai-reviewing AI review in progress labels Apr 23, 2026
@jesseturner21

Copy link
Copy Markdown
Owner Author

Missing jitter despite PR description claiming it

The PR summary says "exponential backoff and jitter", but the delay calculation is fully deterministic:

const delay = Math.min(opts.baseDelayMs * Math.pow(2, attempt), opts.maxDelayMs);

With no jitter, if many callers hit the same throttled API at once they will all retry in lock-step, which is exactly the thundering-herd problem jitter is meant to solve. Options:

  1. Full jitter: const delay = Math.random() * Math.min(opts.baseDelayMs * Math.pow(2, attempt), opts.maxDelayMs);
  2. Equal jitter: half fixed + half random of the computed delay.
  3. Drop the word "jitter" from the PR description / docstring if you intentionally want deterministic backoff.

See https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ for the tradeoffs.

@jesseturner21

Copy link
Copy Markdown
Owner Author

Unnecessary sleep after the final attempt

On the last iteration of the loop, if fn() throws, the code still calls await sleep(delay) before exiting the loop and throwing. On a call with maxRetries: 3 and maxDelayMs: 30000, a caller whose operation will ultimately fail waits up to an extra 4s (2^2 * 1000) for nothing.

Fix: skip the sleep on the final attempt, e.g.

for (let attempt = 0; attempt < opts.maxRetries; attempt++) {
  try {
    return await fn();
  } catch (err) {
    lastError = err as Error;
    // ... retryable check ...
    if (attempt === opts.maxRetries - 1) break;
    const delay = ...;
    await sleep(delay);
  }
}
throw lastError;

@jesseturner21

Copy link
Copy Markdown
Owner Author

parseRetryAfter can return NaN or negative numbers

export function parseRetryAfter(headerValue: string): number {
  const seconds = parseInt(headerValue);
  if (!isNaN(seconds)) {
    return seconds * 1000;
  }
  const date = new Date(headerValue);
  return date.getTime() - Date.now();
}

Two problems for callers that pass this value into setTimeout/sleep:

  1. If headerValue is neither a number nor a valid HTTP date, new Date(headerValue).getTime() is NaN, so the function returns NaN. setTimeout(fn, NaN) fires immediately — likely not what a caller relying on a retry delay expects.
  2. If the HTTP date is in the past (which servers sometimes send due to clock skew), this returns a negative number.

Also minor: parseInt(headerValue) is missing a radix and will happily accept "30abc" as 30. Per RFC 7231 the delay-seconds form must be all digits — consider /^\d+$/.test(headerValue.trim()) before parsing, or use Number(headerValue).

Suggested fix: clamp to >= 0 and return a sentinel (or 0) on unparseable input, e.g.

export function parseRetryAfter(headerValue: string): number {
  const trimmed = headerValue.trim();
  if (/^\d+$/.test(trimmed)) return Number(trimmed) * 1000;
  const ms = new Date(trimmed).getTime() - Date.now();
  return Number.isFinite(ms) ? Math.max(0, ms) : 0;
}

@github-actions github-actions Bot removed the ai-reviewing AI review in progress label Apr 23, 2026
@jesseturner21

Copy link
Copy Markdown
Owner Author

Missing jitter despite being advertised

The PR summary says this adds "exponential backoff and jitter", but the delay calculation has no jitter:

const delay = Math.min(opts.baseDelayMs * Math.pow(2, attempt), opts.maxDelayMs);

If multiple clients hit a throttling error simultaneously, they will all retry in lockstep at exactly 1s, 2s, 4s, etc., which defeats much of the purpose of backoff. Options:

  1. Full jitter: const delay = Math.random() * Math.min(opts.baseDelayMs * Math.pow(2, attempt), opts.maxDelayMs);
  2. Equal jitter: const cap = Math.min(opts.baseDelayMs * Math.pow(2, attempt), opts.maxDelayMs); const delay = cap / 2 + Math.random() * (cap / 2);
  3. Remove "and jitter" from the PR description and docstring if it's intentionally deferred.

@jesseturner21

Copy link
Copy Markdown
Owner Author

maxRetries semantics are off by one

The field is named maxRetries (retries implies additional attempts after the first), but the loop uses it as the total attempt count:

for (let attempt = 0; attempt < opts.maxRetries; attempt++) {
  try {
    return await fn();
  } ...
}

With the default maxRetries: 3, a caller would reasonably expect 1 initial call + 3 retries = 4 total attempts, but they get only 3 total attempts (1 initial + 2 retries). Options:

  1. Change the loop to attempt <= opts.maxRetries so it does 1 + maxRetries total attempts.
  2. Rename the option to maxAttempts to match the current behavior.

Whichever you pick, please update the JSDoc so callers know which semantic applies.

@jesseturner21

Copy link
Copy Markdown
Owner Author

Unnecessary sleep on the final attempt

On the last loop iteration, when fn() throws, the code still calls await sleep(delay) before exiting the loop and throwing lastError. That wastes up to maxDelayMs (default 30s) after we've already decided to give up.

Guard the sleep so it only runs when another attempt will follow, e.g.:

if (attempt < opts.maxRetries - 1) {
  const delay = Math.min(opts.baseDelayMs * Math.pow(2, attempt), opts.maxDelayMs);
  await sleep(delay);
}

@jesseturner21

Copy link
Copy Markdown
Owner Author

parseRetryAfter can return negative numbers or NaN

export function parseRetryAfter(headerValue: string): number {
  const seconds = parseInt(headerValue);
  if (!isNaN(seconds)) {
    return seconds * 1000;
  }
  const date = new Date(headerValue);
  return date.getTime() - Date.now();
}

Two problems that will cause setTimeout to behave unexpectedly if the result is passed to sleep():

  1. If headerValue is an HTTP date in the past, the return value is negative.
  2. If headerValue is neither a valid integer nor a parseable date, date.getTime() is NaN, so the return value is NaN.

Suggest clamping and handling the invalid case, e.g.:

const ms = date.getTime() - Date.now();
return Number.isFinite(ms) ? Math.max(0, ms) : 0;

and similarly Math.max(0, seconds * 1000) for the numeric branch. Also, parseInt should pass an explicit radix (parseInt(headerValue, 10)).

@github-actions

Copy link
Copy Markdown

Coverage Report

Status Category Percentage Covered / Total
🔵 Lines 41.98% 7231 / 17221
🔵 Statements 41.39% 7664 / 18513
🔵 Functions 40.04% 1282 / 3201
🔵 Branches 40.46% 4824 / 11920
Generated in workflow #32 for commit 551bb0d by the Vitest Coverage Report Action

@github-actions github-actions Bot added the ai-reviewing AI review in progress label Apr 23, 2026
@jesseturner21

Copy link
Copy Markdown
Owner Author

Issue: Sleeps after the final failed attempt before throwing.

In withRetry, the loop always calls await sleep(delay) in the catch block, including on the final iteration. After the last attempt fails, this causes the function to sleep (up to maxDelayMs, which defaults to 30 seconds) before throwing lastError. Callers will see an unexpected delay after all retries are exhausted.

for (let attempt = 0; attempt < opts.maxRetries; attempt++) {
  try {
    return await fn();
  } catch (err) {
    lastError = err as Error;
    // ... retryability check ...
    const delay = Math.min(opts.baseDelayMs * Math.pow(2, attempt), opts.maxDelayMs);
    await sleep(delay);  // <-- runs even on the last attempt
  }
}
throw lastError;

Fix options:

  1. Skip the sleep on the final iteration: if (attempt < opts.maxRetries - 1) await sleep(delay);
  2. Restructure so the sleep happens at the top of each retry iteration rather than at the bottom.

@jesseturner21

Copy link
Copy Markdown
Owner Author

Issue: No jitter is applied, despite the PR description claiming "exponential backoff and jitter".

The delay calculation is purely deterministic:

const delay = Math.min(opts.baseDelayMs * Math.pow(2, attempt), opts.maxDelayMs);

There is no randomization, so multiple concurrent callers hitting the same throttled API will all retry in lockstep — which is exactly the thundering herd problem jitter is supposed to prevent. Either add jitter or update the PR description/docstring to not claim jitter.

Common options:

  1. Full jitter (recommended by AWS): const delay = Math.random() * Math.min(opts.baseDelayMs * 2 ** attempt, opts.maxDelayMs);
  2. Equal jitter: const cap = Math.min(opts.baseDelayMs * 2 ** attempt, opts.maxDelayMs); const delay = cap / 2 + Math.random() * (cap / 2);

@jesseturner21

Copy link
Copy Markdown
Owner Author

Issue: maxRetries semantics are ambiguous/off-by-one.

The loop runs for (let attempt = 0; attempt < opts.maxRetries; attempt++), so with the default maxRetries: 3 the function performs 3 total attempts (1 initial + 2 retries), not 3 retries. In most retry libraries (p-retry, AWS SDK retry strategies, etc.) "retries" means attempts after the initial call, so maxRetries: 3 would normally mean 4 total attempts.

Fix options:

  1. Rename the field to maxAttempts to match the current behavior.
  2. Change the loop to attempt <= opts.maxRetries (and adjust the backoff exponent accordingly) so maxRetries actually means the number of retries.

Either is fine, but the name and behavior should agree — otherwise callers will mis-configure it.

@jesseturner21

Copy link
Copy Markdown
Owner Author

Issue: parseRetryAfter can silently return NaN or negative values for bad input.

export function parseRetryAfter(headerValue: string): number {
  const seconds = parseInt(headerValue);
  if (!isNaN(seconds)) {
    return seconds * 1000;
  }
  const date = new Date(headerValue);
  return date.getTime() - Date.now();
}

Problems:

  1. If headerValue is not a valid number or a valid HTTP date (e.g. "", "abc"), date.getTime() is NaN and the function returns NaN. Passing NaN to setTimeout is treated as 0, but it's a surprising return value for a function typed as number.
  2. If the HTTP date is in the past, the function returns a negative number.
  3. parseInt is called without a radix and will happily parse "10abc" as 10.

Fix options:

  1. Clamp the result: return Math.max(0, isNaN(ms) ? 0 : ms); and return 0 (or some sentinel) on unparseable input.
  2. Return number | null / number | undefined and have callers decide the fallback.
  3. At minimum, validate with Number(headerValue) (or parseInt(headerValue, 10)) and guard against NaN from the date branch.

@github-actions github-actions Bot added ai-reviewing AI review in progress and removed ai-reviewing AI review in progress labels Apr 23, 2026
Comment thread src/lib/utils/retry.ts
const delay = Math.min(opts.baseDelayMs * Math.pow(2, attempt), opts.maxDelayMs);
await sleep(delay);
}
}

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing jitter. The PR description says "exponential backoff and jitter," but the delay calculation is purely deterministic:

const delay = Math.min(opts.baseDelayMs * Math.pow(2, attempt), opts.maxDelayMs);

Without jitter, concurrent callers that fail at the same time (e.g., a burst of API calls hitting a throttling limit) will all retry in lockstep, defeating much of the point of backoff for throttling scenarios. Options:

  1. Add full jitter: const delay = Math.random() * Math.min(opts.baseDelayMs * Math.pow(2, attempt), opts.maxDelayMs);
  2. Add equal/decorrelated jitter (see the classic AWS Architecture Blog post on backoff/jitter).
  3. If jitter isn't actually wanted, update the doc comment and PR description to drop the claim.

Comment thread src/lib/utils/retry.ts
await sleep(delay);
}
}

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary sleep after the final attempt. On the last iteration (attempt === maxRetries - 1), if fn() throws, the loop still runs await sleep(delay) before exiting and throwing lastError. That's wasted wall-clock time (up to maxDelayMs, i.e. 30s with defaults) with no retry following it.

Fix: skip the sleep on the last attempt, e.g.:

if (attempt < opts.maxRetries - 1) {
  const delay = /* ... */;
  await sleep(delay);
}

Comment thread src/lib/utils/retry.ts

// Try parsing as HTTP date
const date = new Date(headerValue);
return date.getTime() - Date.now();

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parseRetryAfter can return negative or NaN milliseconds.

  • If headerValue is an HTTP date in the past (e.g., a slightly skewed clock), date.getTime() - Date.now() is negative.
  • If headerValue is neither a number nor a parseable date, new Date(headerValue).getTime() is NaN, so the return value is NaN.

Callers are likely to pass the result straight to sleep() / setTimeout, where NaN or negatives get coerced to 0 (fine) or can cause bugs downstream in logic that compares against maxDelayMs, etc.

Suggested fix: clamp to >= 0 and fall back on invalid input, e.g.:

export function parseRetryAfter(headerValue: string): number {
  const seconds = parseInt(headerValue, 10);
  if (!isNaN(seconds)) return Math.max(0, seconds * 1000);

  const ms = new Date(headerValue).getTime() - Date.now();
  return Number.isFinite(ms) ? Math.max(0, ms) : 0;
}

(Also, pass a radix to parseInt to satisfy common lint rules.)

Comment thread src/lib/utils/retry.ts
let lastError: Error | undefined;

for (let attempt = 0; attempt < opts.maxRetries; attempt++) {
try {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxRetries semantics are misleading. With the current loop (for (let attempt = 0; attempt < opts.maxRetries; attempt++)), maxRetries: 3 means 3 total attempts — i.e., 1 initial call + 2 retries — not 1 initial call + 3 retries as the name implies. Callers reading the type will almost certainly pick the wrong value.

Pick one of:

  1. Rename the option to maxAttempts so the field matches the behavior.
  2. Keep the maxRetries name and change the loop to attempt <= opts.maxRetries (so maxRetries: 3 → 4 total attempts).

Either is fine, but the current name/behavior mismatch is a foot-gun worth resolving before the util gets adopted.

@github-actions github-actions Bot removed the ai-reviewing AI review in progress label Apr 23, 2026
@jesseturner21

Copy link
Copy Markdown
Owner Author

Bug: sleeps after the final failed attempt before throwing

In withRetry, the await sleep(delay) runs unconditionally inside the catch block. On the last iteration of the loop, this causes the function to wait (up to maxDelayMs) before throwing the error, even though no further retry will happen. This wastes time and can make failures feel hung.

Options to fix:

  1. Skip the sleep on the last iteration, e.g. if (attempt < opts.maxRetries - 1) await sleep(delay);
  2. Restructure the loop so the sleep happens at the start of the next attempt instead of after the catch.

@jesseturner21

Copy link
Copy Markdown
Owner Author

Missing jitter despite being advertised

The PR summary says withRetry provides "exponential backoff and jitter", but the implementation only does exponential backoff — there's no jitter:

const delay = Math.min(opts.baseDelayMs * Math.pow(2, attempt), opts.maxDelayMs);

Without jitter, concurrent callers that hit a throttling error will all retry in lockstep, which defeats much of the purpose for API throttling scenarios (one of the stated use cases). Please either add jitter (e.g. delay * (0.5 + Math.random() * 0.5) for "equal jitter", or Math.random() * delay for "full jitter") or update the PR description / doc comment to not claim jitter.

@jesseturner21

Copy link
Copy Markdown
Owner Author

parseRetryAfter can return NaN or negative values silently

Two related issues in parseRetryAfter:

  1. If headerValue isn't a number and isn't a valid HTTP date, new Date(headerValue).getTime() returns NaN, and the function returns NaN. Callers passing this to setTimeout/sleep will get setTimeout(fn, NaN) which is treated as 0 — silent misbehavior.
  2. If the HTTP date is in the past (clock skew, stale header), the function returns a negative number.
  3. Minor: parseInt(headerValue) is missing an explicit radix (parseInt(headerValue, 10)).

Options to fix:

  1. Validate and clamp: check isNaN on the date result, and return a sensible fallback (e.g. 0, or throw, or return undefined and have the caller decide).
  2. At minimum, clamp the return value to >= 0 and document/handle the invalid-input case explicitly.

@jesseturner21

Copy link
Copy Markdown
Owner Author

No tests added

The PR's own test plan lists "Unit tests for retry logic" and "Integration with API calls that return throttling errors" as unchecked items, and no test file was added (the rest of src/lib/utils has a __tests__ directory with tests for each utility). Given the subtle behavior here (retry counts, delay on final failure, retryable-error filtering, parseRetryAfter edge cases), unit tests should land with this change rather than as a follow-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/s PR size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant